Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmark utility extension #277

Merged
merged 4 commits into from
May 3, 2024
Merged

Add benchmark utility extension #277

merged 4 commits into from
May 3, 2024

Conversation

charleskawczynski
Copy link
Member

@charleskawczynski charleskawczynski commented May 2, 2024

This PR adds a benchmark utility. Users can call it with: CTS.benchmark_step(integrator::CTS.DistributedODEIntegrator, device::ClimaComms.AbstractDevice). This is CPU and GPU compatible, and the GPU path also prints output from CUDA's @profile, which shows information like threads, blocks, and registers used. The output looks like:

[ Info: (#)x entries have been multiplied by corresponding factors in order to compute percentages
┌─────────────────────┬────────────┬──────────┬────────────┬────────────┬────────────┬────────────┬───────────┬──────────────────┐
│ Function            │     Memory │   allocs │       Time │       Time │       Time │       Time │ N-samples │ step! percentage │
│                     │   estimate │ estimate │        min │        max │       mean │     median │           │                  │
├─────────────────────┼────────────┼──────────┼────────────┼────────────┼────────────┼────────────┼───────────┼──────────────────┤
│ Wfact (3x)          │    0 bytes │        0 │  43.508 ms │  59.366 ms │  48.653 ms │  48.098 ms │        10 │          29.4579 │
│ ldiv! (3x)          │    0 bytes │        0 │  11.045 ms │  12.610 ms │  11.510 ms │  11.518 ms │        10 │          7.47847 │
│ T_imp! (3x)         │    0 bytes │        0 │   3.913 ms │   4.719 ms │   4.042 ms │   3.954 ms │        10 │          2.64903 │
│ T_exp_T_lim! (4x)   │    0 bytes │        0 │  50.576 ms │  59.781 ms │  55.508 ms │  56.181 ms │        10 │          34.2435 │
│ lim! (4x)           │    0 bytes │        0 │   0.004 ns │   1.168 μs │ 200.002 ns │ 164.000 ns │        10 │       2.70827e-9 │
│ dss! (4x)           │    0 bytes │        0 │   6.888 ms │  12.564 ms │   8.014 ms │   7.538 ms │        10 │          4.66364 │
│ post_explicit! (3x) │  120 bytes │        9 │   9.724 ms │  15.792 ms │  12.280 ms │  11.730 ms │        10 │          6.58398 │
│ post_implicit! (4x) │  160 bytes │       12 │  12.793 ms │  14.665 ms │  13.484 ms │  13.276 ms │        10 │          8.66139 │
│ step! (1x)          │ 106.22 KiB │      211 │ 147.696 ms │ 341.617 ms │ 189.741 ms │ 169.543 ms │        10 │            100.0 │
└─────────────────────┴────────────┴──────────┴────────────┴────────────┴────────────┴────────────┴───────────┴──────────────────┘

A good deal of this came from ClimaAtmos' benchmark script, which happened to be pretty general. I added a frequency component, via n_calls_per_step, so that users know immediately what the percentage breakdown is per ClimaODEFunction. cc @cmbengue

gpu output coming...

@charleskawczynski
Copy link
Member Author

On the gpu, we get a lot more information from CUDA's @profile:

--------------- Benchmarking Wfact... Profile for Wfact:
Profiler ran for 19.44 ms, capturing 34 events.

Host-side activity: calling CUDA APIs took 154.6 µs (0.80% of the trace)
┌────┬──────────┬─────────┬────────────────┐
│ ID │    Start │    Time │ Name           │
├────┼──────────┼─────────┼────────────────┤
│  225.6 µs │ 38.7 µs │ cuLaunchKernel │
│  473.2 µs │  9.9 µs │ cuLaunchKernel │
│  692.7 µs │ 14.1 µs │ cuLaunchKernel │
│  8120.1 µs │ 17.7 µs │ cuLaunchKernel │
│ 10143.3 µs │ 10.5 µs │ cuLaunchKernel │
│ 12160.5 µs │  9.3 µs │ cuLaunchKernel │
│ 14181.4 µs │  9.9 µs │ cuLaunchKernel │
│ 16201.4 µs │ 15.1 µs │ cuLaunchKernel │
│ 18221.9 µs │  7.9 µs │ cuLaunchKernel │
│ 20236.1 µs │  9.4 µs │ cuLaunchKernel │
│ 22252.3 µs │ 10.8 µs │ cuLaunchKernel │
└────┴──────────┴─────────┴────────────────┘

Device-side activity: GPU was busy for 18.78 ms (96.60% of the trace)
┌────┬───────────┬───────────┬─────────┬────────┬──────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ ID │     Start │      Time │ Threads │ Blocks │ Regs │ Name                                                                                                                                         
├────┼───────────┼───────────┼─────────┼────────┼──────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│  2627.01 µs │ 233.54 µs │  4×4×16216×221 │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__w 4863.01 µs │ 650.92 µs │  4×4×16216×232 │ _Z11knl_copyto_5VIJFHI13BandMatrixRowILin0ELi1E7AdjointI7Float3210AxisTensorIS2_Li1E5TupleI17ContravariantAxisI6_1__2_EE6SArrayIS4_ILi2EES2_ 61.52 ms │   1.12 ms │     25627048 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int64___1_Li2E7AdjointI7Float3210AxisTensorIS3_Li1E5T 82.64 ms │   1.11 ms │     25628437 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int64___1_Li2E10AxisTensorI7Float32Li1E5TupleI13Covar 103.75 ms │   2.09 ms │     25627048 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int64___1_Li2E7AdjointI7Float3210AxisTensorIS3_Li1E5T 125.85 ms │ 851.88 µs │     25627032 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int64___1_Li2E7AdjointI7Float3210AxisTensorIS3_Li1E5T 146.7 ms │   1.85 ms │     25627064 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int64___1_Li2E7AdjointI7Float3210AxisTensorIS3_Li1E5T 168.56 ms │   3.41 ms │     25628450 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int64___1_Li2E10AxisTensorI7Float32Li1E5TupleI13Covar 1811.96 ms │   1.65 ms │     25628433 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int64___1_Li2E10AxisTensorI7Float32Li1E5TupleI13Covar 2013.61 ms │   2.86 ms │     25628448 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int64___1_Li2E10AxisTensorI7Float32Li2E5TupleI13Covar 2216.48 ms │   2.96 ms │     25628448 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowILin1ELi3E10AxisTensorI7Float32Li2E5TupleI13CovariantAxisI4_3__E17ContravariantAxisI4 
└────┴───────────┴───────────┴─────────┴────────┴──────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                       1 column omitted


--------------- Benchmarking ldiv!... Profile for ldiv!:
Profiler ran for 19.29 ms, capturing 28 events.

Host-side activity: calling CUDA APIs took 113.2 µs (0.59% of the trace)
┌────┬──────────┬─────────┬────────────────┐
│ ID │    Start │    Time │ Name           │
├────┼──────────┼─────────┼────────────────┤
│  224.6 µs │ 32.4 µs │ cuLaunchKernel │
│  468.2 µs │ 15.6 µs │ cuLaunchKernel │
│  690.1 µs │  6.3 µs │ cuLaunchKernel │
│  8105.3 µs │ 11.2 µs │ cuLaunchKernel │
│ 10123.6 µs │  9.8 µs │ cuLaunchKernel │
│ 12140.1 µs │ 14.4 µs │ cuLaunchKernel │
│ 14160.4 µs │  8.6 µs │ cuLaunchKernel │
│ 16174.9 µs │  8.4 µs │ cuLaunchKernel │
│ 18186.2 µs │  5.3 µs │ cuLaunchKernel │
└────┴──────────┴─────────┴────────────────┘

Device-side activity: GPU was busy for 18.7 ms (96.97% of the trace)
┌────┬───────────┬───────────┬─────────┬────────┬──────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ ID │     Start │      Time │ Threads │ Blocks │ Regs │ Name                                                                                                                                         
├────┼───────────┼───────────┼─────────┼────────┼──────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│  2556.75 µs │   1.04 ms │  4×4×16216×232 │ _Z11knl_copyto_5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int64___1_Li2E10AxisTensorI7Float32Li2E5TupleI13CovariantAxisI4_3__E17C 41.6 ms │   4.99 ms │     25628456 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowILin1ELi3E10AxisTensorI7Float32Li2E5TupleI13CovariantAxisI4_3__E17ContravariantAxisI4 66.6 ms │ 485.91 µs │  4×4×16216×227 │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS2_ILi2EES1_Li1ELi2EEELi4E13CuDeviceArrayIS1_Li5ELi1E 87.09 ms │   2.99 ms │     25628440 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS3_ILi1EES2_Li1ELi1EEELi4E8SubArrayIS 1010.08 ms │   1.64 ms │     2561447 │ _Z28multiple_field_solve_kernel_10CUDADevice5TupleIS0_I5FieldI5VIJFHIS0_Li4E13CuDeviceArrayI7Float32Li5ELi1EEE16PlaceholderSpaceEES0_IS1_IS2 1211.72 ms │   2.39 ms │     25628440 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS3_ILi1EES2_Li1ELi1EEELi4E8SubArrayIS 1414.11 ms │   3.46 ms │     2561443 │ _Z28multiple_field_solve_kernel_10CUDADevice5TupleIS0_I5FieldI5VIJFHIS0_I10AxisTensorI7Float32Li1ES0_I13CovariantAxisI4_3__EE6SArrayIS0_ILi1 1617.57 ms │ 847.28 µs │     25627040 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li5ELi1EE5TupleI5SliceI5OneToI5Int64EES5_IS6_IS7_EES 1818.43 ms │ 858.03 µs │     25627040 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li5ELi1EE5TupleI5SliceI5OneToI5Int64EES5_IS6_IS7_EES 
└────┴───────────┴───────────┴─────────┴────────┴──────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                       1 column omitted


--------------- Benchmarking T_imp!... Profile for T_imp!:
Profiler ran for 8.51 ms, capturing 16 events.

Host-side activity: calling CUDA APIs took 78.7 µs (0.92% of the trace)
┌────┬──────────┬─────────┬──────────────────┐
│ ID │    Start │    Time │ Name             │
├────┼──────────┼─────────┼──────────────────┤
│  22.7 µs │ 32.1 µs │ cuMemsetD32Async │
│  435.8 µs │  4.4 µs │ cuMemsetD32Async │
│  666.9 µs │ 20.0 µs │ cuLaunchKernel   │
│  894.7 µs │ 10.2 µs │ cuLaunchKernel   │
│ 10114.5 µs │ 11.5 µs │ cuLaunchKernel   │
└────┴──────────┴─────────┴──────────────────┘

Device-side activity: GPU was busy for 8.02 ms (94.28% of the trace)
┌────┬───────────┬───────────┬─────────┬────────┬──────┬─────────────┬─────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ ID │     Start │      Time │ Threads │ Blocks │ Regs │        Size │  Throughput │ Name                                                                                                             
├────┼───────────┼───────────┼─────────┼────────┼──────┼─────────────┼─────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│  2470.45 µs │ 140.03 µs │       ---1.055 MiB │ 7.355 GiB/s │ [set device memory]                                                                                              4612.82 µs │  45.57 µs │       ---283.500 KiB │ 5.933 GiB/s │ [set device memory]                                                                                              6660.79 µs │   2.61 ms │     25627049-- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li5ELi1EE5TupleI5SliceI5 83.27 ms │   2.69 ms │     25627056-- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li5ELi1EE5TupleI5SliceI5 105.97 ms │   2.53 ms │     25628438-- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__EE6SArrayIS3_ILi1EE 
└────┴───────────┴───────────┴─────────┴────────┴──────┴─────────────┴─────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                       1 column omitted


--------------- Benchmarking T_exp_T_lim!... Profile for T_exp_T_lim!:
Profiler ran for 34.62 ms, capturing 75 events.

Host-side activity: calling CUDA APIs took 9.44 ms (27.27% of the trace)
┌────┬─────────┬──────────┬─────────────────────┐
│ ID │   Start │     Time │ Name                │
├────┼─────────┼──────────┼─────────────────────┤
│  23.3 µs │ 989.1 µs │ cuMemsetD32Async    │
│  41.01 ms │   7.9 µs │ cuMemsetD32Async    │
│  61.04 ms │   3.2 µs │ cuMemsetD32Async    │
│  81.04 ms │   2.6 µs │ cuMemsetD32Async    │
│ 101.09 ms │  30.1 µs │ cuLaunchKernel      │
│ 121.13 ms │  12.8 µs │ cuLaunchKernel      │
│ 141.15 ms │  26.3 µs │ cuLaunchKernel      │
│ 161.18 ms │   8.7 µs │ cuLaunchKernel      │
│ 181.19 ms │  12.1 µs │ cuLaunchKernel      │
│ 201.21 ms │  8.21 ms │ cuStreamSynchronize │
│ 229.43 ms │   8.9 µs │ cuLaunchKernel      │
│ 249.45 ms │   5.9 µs │ cuLaunchKernel      │
│ 269.46 ms │   6.7 µs │ cuLaunchKernel      │
│ 289.47 ms │   3.7 µs │ cuLaunchKernel      │
│ 309.48 ms │   3.0 µs │ cuLaunchKernel      │
│ 329.48 ms │   3.7 µs │ cuLaunchKernel      │
│ 349.49 ms │  10.9 µs │ cuLaunchKernel      │
│ 369.51 ms │   7.4 µs │ cuLaunchKernel      │
│ 389.52 ms │  10.7 µs │ cuLaunchKernel      │
│ 409.54 ms │   9.3 µs │ cuLaunchKernel      │
│ 429.56 ms │   7.4 µs │ cuLaunchKernel      │
│ 449.57 ms │   7.8 µs │ cuLaunchKernel      │
│ 469.58 ms │  17.5 µs │ cuLaunchKernel      │
│ 489.61 ms │  17.3 µs │ cuLaunchKernel      │
│ 509.63 ms │  14.0 µs │ cuLaunchKernel      │
└────┴─────────┴──────────┴─────────────────────┘

Device-side activity: GPU was busy for 32.77 ms (94.67% of the trace)
┌────┬──────────┬───────────┬─────────┬────────┬──────┬───────────────────┬─────────────┬─────────────┬────────────────────────────────────────────────────────────────────────────────────────────────
│ ID │    Start │      Time │ Threads │ Blocks │ Regs │        Shared Mem │        Size │  Throughput │ Name                                                                                          
├────┼──────────┼───────────┼─────────┼────────┼──────┼───────────────────┼─────────────┼─────────────┼────────────────────────────────────────────────────────────────────────────────────────────────
│  21.52 ms │ 135.55 µs │       ----1.055 MiB │ 7.598 GiB/s │ [set device memory]                                                                           41.66 ms │  44.99 µs │       ----283.500 KiB │ 6.009 GiB/s │ [set device memory]                                                                           61.71 ms │  147.2 µs │       ----1.055 MiB │ 6.997 GiB/s │ [set device memory]                                                                           81.86 ms │  44.03 µs │       ----283.500 KiB │ 6.140 GiB/s │ [set device memory]                                                                           101.9 ms │   1.08 ms │  4×4×16216×2402.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li5E 122.99 ms │   1.17 ms │  4×4×16216×2402.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li5E 144.16 ms │   1.01 ms │  4×4×16216×2392.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1_ 165.18 ms │   3.34 ms │  4×4×16216×2629.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9_1_ 188.52 ms │ 898.49 µs │  4×4×16216×2403.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16Placehold 229.65 ms │   1.94 ms │     256203125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5ES_I 2411.6 ms │ 446.56 µs │     25615353--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_Li1E 2612.05 ms │   1.93 ms │     256203117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5ES 2813.98 ms │ 521.25 µs │     256203125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5ES_I 3014.5 ms │ 163.13 µs │     2565153--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_Li1E 3214.67 ms │ 503.17 µs │     256203117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5ES 3415.17 ms │   3.37 ms │  4×4×16216×2629.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9_1_ 3618.55 ms │  552.0 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS2_ILi 3819.1 ms │    1.5 ms │     25628433--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__ 4020.6 ms │   1.23 ms │  4×4×16216×2623.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li5E 4221.84 ms │ 424.16 µs │  4×4×16216×2312.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxisI 4422.27 ms │   1.11 ms │     25628433--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxisI6 4623.39 ms │ 687.04 µs │  4×4×16216×2321024 bytes static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxisI 4824.1 ms │   7.48 ms │     25627078--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__ 5031.58 ms │   3.03 ms │     25628458--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_3__ 
└────┴──────────┴───────────┴─────────┴────────┴──────┴───────────────────┴─────────────┴─────────────┴────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                       1 column omitted


--------------- Benchmarking lim!... Profile for lim!:
Profiler ran for 2.1 µs, capturing 1 events.

No host-side activity was recorded.

No device-side activity was recorded.


--------------- Benchmarking dss!... Profile for dss!:
Profiler ran for 8.12 ms, capturing 21 events.

Host-side activity: calling CUDA APIs took 1.53 ms (18.78% of the trace)
┌────┬──────────┬─────────┬─────────────────────┐
│ ID │    Start │    Time │ Name                │
├────┼──────────┼─────────┼─────────────────────┤
│  210.7 µs │ 99.9 µs │ cuStreamSynchronize │
│  4185.6 µs │ 1.28 ms │ cuLaunchKernel      │
│  61.52 ms │ 34.4 µs │ cuLaunchKernel      │
│  81.6 ms │ 33.6 µs │ cuLaunchKernel      │
│ 101.67 ms │ 25.8 µs │ cuLaunchKernel      │
│ 121.71 ms │ 20.8 µs │ cuLaunchKernel      │
│ 141.76 ms │ 29.3 µs │ cuLaunchKernel      │
└────┴──────────┴─────────┴─────────────────────┘

Device-side activity: GPU was busy for 5.92 ms (72.86% of the trace)
┌────┬─────────┬───────────┬─────────┬────────┬──────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ ID │   Start │      Time │ Threads │ Blocks │ Regs │ Name                                                                                                                                           
├────┼─────────┼───────────┼─────────┼────────┼──────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│  42.18 ms │   2.04 ms │     256203125 │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5ES_IS0_Li5ELi1EE5TupleI5SliceI5OneToI5Int64EES3_IS4_I 64.22 ms │ 589.25 µs │     25620453 │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_Li1ELi1EES_IS1_IS2_S2_S2_S2_4BoolELi1ELi1EE11Perimete 84.81 ms │   2.04 ms │     256203117 │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5ES_IS0_Li5ELi1EE5TupleI5SliceI5OneToI5Int64EES3_IS4 106.86 ms │ 546.81 µs │     256213125 │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5ES_IS0_Li5ELi1EE5TupleI5SliceI5OneToI5Int64EES3_IS4_I 127.41 ms │ 169.09 µs │     2565453 │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_Li1ELi1EES_IS1_IS2_S2_S2_S2_4BoolELi1ELi1EE11Perimete 147.58 ms │ 529.57 µs │     256213117 │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5ES_IS0_Li5ELi1EE5TupleI5SliceI5OneToI5Int64EES3_IS4 
└────┴─────────┴───────────┴─────────┴────────┴──────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                       1 column omitted


--------------- Benchmarking post_explicit!... Profile for post_explicit!:
Profiler ran for 10.21 ms, capturing 34 events.

Host-side activity: calling CUDA APIs took 1.49 ms (14.64% of the trace)
┌────┬─────────┬──────────┬────────────────┐
│ ID │   Start │     Time │ Name           │
├────┼─────────┼──────────┼────────────────┤
│  239.5 µs │ 916.5 µs │ cuLaunchKernel │
│  41.02 ms │  67.7 µs │ cuLaunchKernel │
│  61.11 ms │  23.2 µs │ cuLaunchKernel │
│  81.15 ms │  16.6 µs │ cuLaunchKernel │
│ 101.19 ms │  27.8 µs │ cuLaunchKernel │
│ 121.23 ms │  17.9 µs │ cuLaunchKernel │
│ 141.27 ms │  71.0 µs │ cuLaunchKernel │
│ 161.37 ms │  17.5 µs │ cuLaunchKernel │
│ 181.39 ms │  12.6 µs │ cuLaunchKernel │
│ 201.42 ms │ 275.4 µs │ cuLaunchKernel │
│ 221.72 ms │  43.4 µs │ cuLaunchKernel │
└────┴─────────┴──────────┴────────────────┘

Device-side activity: GPU was busy for 8.07 ms (78.99% of the trace)
┌────┬─────────┬───────────┬─────────┬────────┬──────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ ID │   Start │      Time │ Threads │ Blocks │ Regs │ Name                                                                                                                                           
├────┼─────────┼───────────┼─────────┼────────┼──────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│  22.11 ms │ 311.26 µs │  4×4×16216×229 │ _Z11knl_copyto_5VIJFHI10NamedTupleI9__e_tot__5TupleI7Float32EELi4E13CuDeviceArrayIS2_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float3 42.42 ms │   1.77 ms │     25628448 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxisI4_3__EE6SArrayIS3_ILi1EES2_Li1ELi1EEELi4E13CuDevic 64.2 ms │  81.95 µs │     4×421630 │ _Z11knl_copyto_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int645SliceI5OneToIS4_EES5_IS6_IS4_EE9UnitRangeIS4_ES5_I 84.29 ms │  33.47 µs │     4×421623 │ _Z9knl_fill_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int645SliceI5OneToIS4_EES5_IS6_IS4_EE9UnitRangeIS4_ES5_IS6_ 104.32 ms │  847.2 µs │     25627033 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9_1__2__3_EE6SArrayIS3_ILi3EES2_Li1ELi3EEELi4E13CuDevi 125.17 ms │ 538.14 µs │  4×4×16216×233 │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxisI4_3__EE6SArrayIS2_ILi1EES1_Li1ELi1EEELi4E13CuDeviceArrayIS1_Li5ELi1E 145.71 ms │   2.19 ms │     25627072 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS3_4 167.9 ms │  556.9 µs │  4×4×16216×231 │ _Z11knl_copyto_5VIJFHI8PhaseDryI7Float32ELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_Device 188.46 ms │ 371.55 µs │  4×4×16216×232 │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__whe 208.84 ms │  380.7 µs │  4×4×16216×231 │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__whe 229.22 ms │ 982.65 µs │     4×421658 │ _Z11knl_copyto_4IJFHI10NamedTupleI73__ts___ustar___obukhov_length___buoyancy_flux_____flux_u______flux_h_tot_5TupleI8PhaseDryI7Float32ES3_S3_S 
└────┴─────────┴───────────┴─────────┴────────┴──────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                       1 column omitted


--------------- Benchmarking post_implicit!... Profile for post_implicit!:
Profiler ran for 9.33 ms, capturing 34 events.

Host-side activity: calling CUDA APIs took 815.5 µs (8.74% of the trace)
┌────┬──────────┬──────────┬────────────────┐
│ ID │    Start │     Time │ Name           │
├────┼──────────┼──────────┼────────────────┤
│  227.4 µs │ 717.9 µs │ cuLaunchKernel │
│  4775.0 µs │  15.3 µs │ cuLaunchKernel │
│  6799.4 µs │   7.4 µs │ cuLaunchKernel │
│  8814.2 µs │   6.2 µs │ cuLaunchKernel │
│ 10828.9 µs │  10.9 µs │ cuLaunchKernel │
│ 12847.3 µs │   7.8 µs │ cuLaunchKernel │
│ 14863.8 µs │  16.4 µs │ cuLaunchKernel │
│ 16889.2 µs │   7.0 µs │ cuLaunchKernel │
│ 18903.0 µs │   5.4 µs │ cuLaunchKernel │
│ 20914.7 µs │   6.0 µs │ cuLaunchKernel │
│ 22930.8 µs │  13.2 µs │ cuLaunchKernel │
└────┴──────────┴──────────┴────────────────┘

Device-side activity: GPU was busy for 8.05 ms (86.30% of the trace)
┌────┬─────────┬───────────┬─────────┬────────┬──────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ ID │   Start │      Time │ Threads │ Blocks │ Regs │ Name                                                                                                                                           
├────┼─────────┼───────────┼─────────┼────────┼──────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│  21.21 ms │ 313.57 µs │  4×4×16216×229 │ _Z11knl_copyto_5VIJFHI10NamedTupleI9__e_tot__5TupleI7Float32EELi4E13CuDeviceArrayIS2_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float3 41.52 ms │   1.77 ms │     25628448 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxisI4_3__EE6SArrayIS3_ILi1EES2_Li1ELi1EEELi4E13CuDevic 63.29 ms │  81.25 µs │     4×421630 │ _Z11knl_copyto_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int645SliceI5OneToIS4_EES5_IS6_IS4_EE9UnitRangeIS4_ES5_I 83.38 ms │  33.57 µs │     4×421623 │ _Z9knl_fill_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int645SliceI5OneToIS4_EES5_IS6_IS4_EE9UnitRangeIS4_ES5_IS6_ 103.41 ms │ 846.78 µs │     25627033 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9_1__2__3_EE6SArrayIS3_ILi3EES2_Li1ELi3EEELi4E13CuDevi 124.26 ms │  536.0 µs │  4×4×16216×233 │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxisI4_3__EE6SArrayIS2_ILi1EES1_Li1ELi1EEELi4E13CuDeviceArrayIS1_Li5ELi1E 144.8 ms │   2.19 ms │     25627072 │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16PlaceholderSpaceE11BroadcastedI22CUDAColumnStencilStyleS3_4 166.99 ms │ 557.25 µs │  4×4×16216×231 │ _Z11knl_copyto_5VIJFHI8PhaseDryI7Float32ELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_Device 187.55 ms │ 373.82 µs │  4×4×16216×232 │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__whe 207.93 ms │ 378.62 µs │  4×4×16216×231 │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleILi4E50CuArray_Float32__N__CUDA_Mem_DeviceBuffer__whe 228.31 ms │ 975.29 µs │     4×421658 │ _Z11knl_copyto_4IJFHI10NamedTupleI73__ts___ustar___obukhov_length___buoyancy_flux_____flux_u______flux_h_tot_5TupleI8PhaseDryI7Float32ES3_S3_S 
└────┴─────────┴───────────┴─────────┴────────┴──────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                       1 column omitted


--------------- Benchmarking step!...[ Info: Progress: Completed first step
┌ Info: Progress
│   simulation_time = "30 seconds"
│   n_steps_completed = 3
│   wall_time_per_step = "30 milliseconds, 666 microseconds"
│   wall_time_total = "11 seconds, 40 milliseconds"
│   wall_time_remaining = "10 seconds, 948 milliseconds"
│   wall_time_spent = "92 milliseconds, 8 nanoseconds"
│   percent_complete = "0.8%"
│   sypd = 0.893
│   date_now = 2024-05-02T15:08:59.358
└   estimated_finish_date = 2024-05-02T15:09:10.289
┌ Info: Progress
│   simulation_time = "50 seconds"
│   n_steps_completed = 5
│   wall_time_per_step = "118 milliseconds, 799 microseconds"
│   wall_time_total = "42 seconds, 767 milliseconds"
│   wall_time_remaining = "42 seconds, 173 milliseconds"
│   wall_time_spent = "593 milliseconds, 999 microseconds"
│   percent_complete = "1.4%"
│   sypd = 0.231
│   date_now = 2024-05-02T15:08:59.662
└   estimated_finish_date = 2024-05-02T15:09:41.830
┌ Info: Progress
│   simulation_time = "1 minute, 30 seconds"
│   n_steps_completed = 9
│   wall_time_per_step = "86 milliseconds, 222 microseconds"
│   wall_time_total = "31 seconds, 40 milliseconds"
│   wall_time_remaining = "30 seconds, 264 milliseconds"
│   wall_time_spent = "776 milliseconds, 23 nanoseconds"
│   percent_complete = "2.5%"
│   sypd = 0.318
│   date_now = 2024-05-02T15:08:59.877
└   estimated_finish_date = 2024-05-02T15:09:30.141
 Profile for step!:
Profiler ran for 202.94 ms, capturing 1155 events.

Host-side activity: calling CUDA APIs took 198.29 ms (97.71% of the trace)
┌─────┬───────────┬──────────┬─────────────────────┐
│  ID │     Start │     Time │ Name                │
├─────┼───────────┼──────────┼─────────────────────┤
│   439.7 µs │ 702.1 µs │ cuLaunchKernel      │
│   8760.8 µs │   6.6 µs │ cuLaunchKernel      │
│  10771.0 µs │  10.8 µs │ cuMemsetD32Async    │
│  12782.4 µs │   3.1 µs │ cuMemsetD32Async    │
│  14786.7 µs │   3.0 µs │ cuMemsetD32Async    │
│  16790.3 µs │   2.4 µs │ cuMemsetD32Async    │
│  18802.5 µs │   7.3 µs │ cuLaunchKernel      │
│  20815.1 µs │   6.6 µs │ cuLaunchKernel      │
│  22828.7 µs │   8.1 µs │ cuLaunchKernel      │
│  24843.0 µs │   6.6 µs │ cuLaunchKernel      │
│  26854.6 µs │   5.8 µs │ cuLaunchKernel      │
│  28863.1 µs │  9.77 ms │ cuStreamSynchronize │
│  3010.66 ms │  16.7 µs │ cuLaunchKernel      │
│  3210.68 ms │   5.3 µs │ cuLaunchKernel      │
│  3410.69 ms │   5.2 µs │ cuLaunchKernel      │
│  3610.7 ms │   3.8 µs │ cuLaunchKernel      │
│  3810.71 ms │   3.0 µs │ cuLaunchKernel      │
│  4010.71 ms │   3.6 µs │ cuLaunchKernel      │
│  4210.72 ms │   7.0 µs │ cuLaunchKernel      │
│  4410.74 ms │  11.3 µs │ cuLaunchKernel      │
│  4610.76 ms │   8.0 µs │ cuLaunchKernel      │
│  4810.77 ms │   6.0 µs │ cuLaunchKernel      │
│  5010.78 ms │   5.6 µs │ cuLaunchKernel      │
│  5210.79 ms │   5.6 µs │ cuLaunchKernel      │
│  5410.81 ms │  38.3 µs │ cuLaunchKernel      │
│  5610.85 ms │  11.4 µs │ cuLaunchKernel      │
│  5810.87 ms │   7.8 µs │ cuLaunchKernel      │
│  6210.91 ms │   6.1 µs │ cuLaunchKernel      │
│  6610.92 ms │   3.6 µs │ cuLaunchKernel      │
│  7010.94 ms │   3.4 µs │ cuLaunchKernel      │
│  7410.95 ms │   3.3 µs │ cuLaunchKernel      │
│  7610.95 ms │ 29.14 ms │ cuStreamSynchronize │
│  7840.12 ms │  32.1 µs │ cuLaunchKernel      │
│  8040.17 ms │   9.0 µs │ cuLaunchKernel      │
│  8240.19 ms │  26.7 µs │ cuLaunchKernel      │
│  8440.22 ms │   6.7 µs │ cuLaunchKernel      │
│  8640.23 ms │   5.2 µs │ cuLaunchKernel      │
│  8840.26 ms │   5.7 µs │ cuLaunchKernel      │
│  9240.29 ms │   6.8 µs │ cuLaunchKernel      │
│  9640.3 ms │   7.7 µs │ cuLaunchKernel      │
│  9840.32 ms │   7.8 µs │ cuLaunchKernel      │
│ 10040.34 ms │   9.9 µs │ cuLaunchKernel      │
│ 10240.36 ms │   7.2 µs │ cuLaunchKernel      │
│ 10440.37 ms │   6.2 µs │ cuLaunchKernel      │
│ 10640.39 ms │  10.4 µs │ cuLaunchKernel      │
│ 10840.41 ms │  53.1 µs │ cuLaunchKernel      │
│ 11040.47 ms │  17.3 µs │ cuLaunchKernel      │
│ 11240.5 ms │   7.3 µs │ cuLaunchKernel      │
│ 11440.51 ms │   6.0 µs │ cuLaunchKernel      │
│ 11640.53 ms │   7.3 µs │ cuLaunchKernel      │
│ 11840.55 ms │  22.0 µs │ cuLaunchKernel      │
│ 12040.59 ms │   8.3 µs │ cuLaunchKernel      │
│ 12240.61 ms │  20.9 µs │ cuLaunchKernel      │
│ 12440.64 ms │  12.1 µs │ cuLaunchKernel      │
│ 12640.65 ms │   9.2 µs │ cuLaunchKernel      │
│ 12840.67 ms │   9.5 µs │ cuLaunchKernel      │
│ 13040.69 ms │  19.6 µs │ cuLaunchKernel      │
│ 13240.74 ms │  13.3 µs │ cuLaunchKernel      │
│ 13440.78 ms │  13.7 µs │ cuLaunchKernel      │
│ 13640.81 ms │   8.6 µs │ cuLaunchKernel      │
│ 13840.82 ms │  10.2 µs │ cuLaunchKernel      │
│ 14040.84 ms │  10.0 µs │ cuLaunchKernel      │
│ 14240.86 ms │  12.8 µs │ cuMemsetD32Async    │
│ 14440.87 ms │   4.7 µs │ cuMemsetD32Async    │
│ 14640.88 ms │  11.9 µs │ cuLaunchKernel      │
│ 14840.91 ms │  11.5 µs │ cuLaunchKernel      │
│ 15040.93 ms │  11.7 µs │ cuLaunchKernel      │
│ 15440.96 ms │   9.2 µs │ cuLaunchKernel      │
│ 15840.97 ms │  15.7 µs │ cuLaunchKernel      │
│ 16041.02 ms │   7.6 µs │ cuLaunchKernel      │
│ 16241.04 ms │  24.7 µs │ cuLaunchKernel      │
│ 16441.07 ms │   7.1 µs │ cuLaunchKernel      │
│ 16641.09 ms │  10.6 µs │ cuLaunchKernel      │
│ 16841.1 ms │   7.4 µs │ cuLaunchKernel      │
│ 17041.12 ms │   8.9 µs │ cuLaunchKernel      │
│ 17241.14 ms │   8.2 µs │ cuLaunchKernel      │
│ 17441.15 ms │   8.7 µs │ cuLaunchKernel      │
│ 17641.17 ms │   9.0 µs │ cuLaunchKernel      │
│ 18041.19 ms │  27.6 µs │ cuLaunchKernel      │
│ 18441.23 ms │   5.6 µs │ cuLaunchKernel      │
│ 18841.26 ms │   8.4 µs │ cuLaunchKernel      │
│ 19241.28 ms │  14.0 µs │ cuLaunchKernel      │
│ 19441.3 ms │   6.7 µs │ cuLaunchKernel      │
│ 19641.31 ms │  10.3 µs │ cuLaunchKernel      │
│ 19841.33 ms │   7.3 µs │ cuLaunchKernel      │
│ 20041.34 ms │   5.3 µs │ cuLaunchKernel      │
│ 20241.35 ms │   9.0 µs │ cuLaunchKernel      │
│ 20441.37 ms │   6.5 µs │ cuLaunchKernel      │
│ 20641.38 ms │  15.0 µs │ cuLaunchKernel      │
│ 20841.4 ms │   6.0 µs │ cuLaunchKernel      │
│ 21041.41 ms │   5.3 µs │ cuLaunchKernel      │
│ 21241.42 ms │   5.8 µs │ cuLaunchKernel      │
│ 21441.43 ms │  11.7 µs │ cuLaunchKernel      │
│ 21641.45 ms │  19.4 µs │ cuMemsetD32Async    │
│ 21841.47 ms │  24.2 µs │ cuMemsetD32Async    │
│ 22041.5 ms │   4.8 µs │ cuMemsetD32Async    │
│ 22241.51 ms │   4.3 µs │ cuMemsetD32Async    │
│ 22441.52 ms │   9.1 µs │ cuLaunchKernel      │
│ 22641.54 ms │   9.0 µs │ cuLaunchKernel      │
│ 22841.56 ms │  10.6 µs │ cuLaunchKernel      │
│ 23041.58 ms │   9.1 µs │ cuLaunchKernel      │
│ 23241.59 ms │   8.7 µs │ cuLaunchKernel      │
│ 23441.61 ms │ 84.99 ms │ cuStreamSynchronize │
│ 236126.65 ms │  23.8 µs │ cuLaunchKernel      │
│ 238126.68 ms │  19.0 µs │ cuLaunchKernel      │
│ 240126.7 ms │   5.3 µs │ cuLaunchKernel      │
│ 242126.71 ms │   3.5 µs │ cuLaunchKernel      │
│ 244126.72 ms │   2.7 µs │ cuLaunchKernel      │
│ 246126.72 ms │   3.6 µs │ cuLaunchKernel      │
│ 248126.74 ms │   6.8 µs │ cuLaunchKernel      │
│ 250126.75 ms │   5.5 µs │ cuLaunchKernel      │
│ 252126.76 ms │   7.4 µs │ cuLaunchKernel      │
│ 254126.78 ms │   6.1 µs │ cuLaunchKernel      │
│ 256126.79 ms │   5.5 µs │ cuLaunchKernel      │
│ 258126.8 ms │   5.6 µs │ cuLaunchKernel      │
│ 260126.82 ms │  60.5 µs │ cuLaunchKernel      │
│ 262126.89 ms │  31.3 µs │ cuLaunchKernel      │
│ 264126.93 ms │   8.1 µs │ cuLaunchKernel      │
│ 268126.98 ms │  26.1 µs │ cuLaunchKernel      │
│ 272127.01 ms │   4.1 µs │ cuLaunchKernel      │
│ 276127.03 ms │  19.3 µs │ cuLaunchKernel      │
│ 280127.05 ms │  16.8 µs │ cuLaunchKernel      │
│ 284127.08 ms │   3.8 µs │ cuLaunchKernel      │
│ 288127.09 ms │   3.0 µs │ cuLaunchKernel      │
│ 290127.1 ms │ 32.58 ms │ cuStreamSynchronize │
│ 292159.74 ms │  39.1 µs │ cuLaunchKernel      │
│ 294159.8 ms │   6.5 µs │ cuLaunchKernel      │
│ 296159.82 ms │   5.7 µs │ cuLaunchKernel      │
│ 298159.84 ms │   4.2 µs │ cuLaunchKernel      │
│ 300159.85 ms │   3.0 µs │ cuLaunchKernel      │
│ 302159.86 ms │   3.9 µs │ cuLaunchKernel      │
│ 306159.89 ms │   4.8 µs │ cuLaunchKernel      │
│ 310159.9 ms │   3.4 µs │ cuLaunchKernel      │
│ 312159.93 ms │   5.3 µs │ cuLaunchKernel      │
│ 314159.96 ms │  26.8 µs │ cuLaunchKernel      │
│ 316159.99 ms │   4.9 µs │ cuLaunchKernel      │
│ 318160.0 ms │  18.2 µs │ cuLaunchKernel      │
│ 320160.03 ms │  25.9 µs │ cuLaunchKernel      │
│ 322160.07 ms │  45.0 µs │ cuLaunchKernel      │
│ 324160.13 ms │  12.9 µs │ cuLaunchKernel      │
│ 326160.26 ms │   5.0 µs │ cuLaunchKernel      │
│ 328160.27 ms │   3.9 µs │ cuLaunchKernel      │
│ 330160.28 ms │   4.7 µs │ cuLaunchKernel      │
│ 332160.3 ms │  28.3 µs │ cuLaunchKernel      │
│ 334160.34 ms │  23.0 µs │ cuLaunchKernel      │
│ 336160.37 ms │  24.3 µs │ cuLaunchKernel      │
│ 338160.4 ms │   8.8 µs │ cuLaunchKernel      │
│ 340160.43 ms │   6.6 µs │ cuLaunchKernel      │
│ 342160.44 ms │   6.5 µs │ cuLaunchKernel      │
│ 344160.45 ms │  33.2 µs │ cuLaunchKernel      │
│ 346160.5 ms │   7.8 µs │ cuLaunchKernel      │
│ 348160.51 ms │   9.8 µs │ cuLaunchKernel      │
│ 350160.54 ms │   5.6 µs │ cuLaunchKernel      │
│ 352160.55 ms │   6.8 µs │ cuLaunchKernel      │
│ 354160.58 ms │   7.5 µs │ cuLaunchKernel      │
│ 356160.59 ms │  23.5 µs │ cuMemsetD32Async    │
│ 358160.62 ms │   2.7 µs │ cuMemsetD32Async    │
│ 360160.65 ms │   7.4 µs │ cuLaunchKernel      │
│ 362160.66 ms │  20.5 µs │ cuLaunchKernel      │
│ 364160.69 ms │  14.4 µs │ cuLaunchKernel      │
│ 368160.71 ms │   5.1 µs │ cuLaunchKernel      │
│ 372160.74 ms │  11.0 µs │ cuLaunchKernel      │
│ 374160.75 ms │  18.8 µs │ cuLaunchKernel      │
│ 376160.79 ms │   7.5 µs │ cuLaunchKernel      │
│ 378160.8 ms │   4.3 µs │ cuLaunchKernel      │
│ 380160.81 ms │  20.6 µs │ cuLaunchKernel      │
│ 382160.84 ms │   5.0 µs │ cuLaunchKernel      │
│ 384160.85 ms │   6.6 µs │ cuLaunchKernel      │
│ 386160.87 ms │   5.5 µs │ cuLaunchKernel      │
│ 388160.88 ms │   5.9 µs │ cuLaunchKernel      │
│ 390160.89 ms │   5.2 µs │ cuLaunchKernel      │
│ 394160.9 ms │  18.9 µs │ cuLaunchKernel      │
│ 398160.93 ms │  16.4 µs │ cuLaunchKernel      │
│ 402160.95 ms │  18.0 µs │ cuLaunchKernel      │
│ 406160.97 ms │   9.3 µs │ cuLaunchKernel      │
│ 408160.99 ms │   4.7 µs │ cuLaunchKernel      │
│ 410161.01 ms │   6.1 µs │ cuLaunchKernel      │
│ 412161.03 ms │   4.2 µs │ cuLaunchKernel      │
│ 414161.05 ms │   3.2 µs │ cuLaunchKernel      │
│ 416161.06 ms │   6.4 µs │ cuLaunchKernel      │
│ 418161.07 ms │   4.4 µs │ cuLaunchKernel      │
│ 420161.08 ms │  10.5 µs │ cuLaunchKernel      │
│ 422161.09 ms │   3.7 µs │ cuLaunchKernel      │
│ 424161.1 ms │   3.3 µs │ cuLaunchKernel      │
│ 426161.11 ms │   3.5 µs │ cuLaunchKernel      │
│ 428161.11 ms │   8.6 µs │ cuLaunchKernel      │
│ 430161.13 ms │  12.8 µs │ cuMemsetD32Async    │
│ 432161.14 ms │   3.7 µs │ cuMemsetD32Async    │
│ 434161.14 ms │   2.5 µs │ cuMemsetD32Async    │
│ 436161.15 ms │   2.4 µs │ cuMemsetD32Async    │
│ 438161.16 ms │   6.1 µs │ cuLaunchKernel      │
│ 440161.19 ms │   6.4 µs │ cuLaunchKernel      │
│ 442161.2 ms │   7.2 µs │ cuLaunchKernel      │
│ 444161.21 ms │   6.3 µs │ cuLaunchKernel      │
│ 446161.22 ms │   5.8 µs │ cuLaunchKernel      │
│ 448161.23 ms │  22.2 ms │ cuStreamSynchronize │
│ 450183.45 ms │  20.9 µs │ cuLaunchKernel      │
│ 452183.48 ms │   5.4 µs │ cuLaunchKernel      │
│ 454183.49 ms │   5.8 µs │ cuLaunchKernel      │
│ 456183.5 ms │   3.9 µs │ cuLaunchKernel      │
│ 458183.51 ms │   2.9 µs │ cuLaunchKernel      │
│ 460183.52 ms │   3.6 µs │ cuLaunchKernel      │
│ 462183.53 ms │   6.6 µs │ cuLaunchKernel      │
│ 464183.54 ms │   5.6 µs │ cuLaunchKernel      │
│ 466183.55 ms │   7.7 µs │ cuLaunchKernel      │
│ 468183.57 ms │   5.9 µs │ cuLaunchKernel      │
│ 470183.6 ms │   6.6 µs │ cuLaunchKernel      │
│ 472183.61 ms │   5.7 µs │ cuLaunchKernel      │
│ 474183.62 ms │  20.2 µs │ cuLaunchKernel      │
│ 476183.65 ms │  10.8 µs │ cuLaunchKernel      │
│ 478183.67 ms │   7.5 µs │ cuLaunchKernel      │
│ 482183.69 ms │   8.9 µs │ cuLaunchKernel      │
│ 486183.71 ms │   4.2 µs │ cuLaunchKernel      │
│ 490183.72 ms │   4.2 µs │ cuLaunchKernel      │
│ 494183.73 ms │   3.9 µs │ cuLaunchKernel      │
│ 498183.75 ms │   4.5 µs │ cuLaunchKernel      │
│ 502183.76 ms │   3.6 µs │ cuLaunchKernel      │
│ 504183.76 ms │  3.59 ms │ cuStreamSynchronize │
│ 506187.36 ms │   5.8 µs │ cuLaunchKernel      │
│ 508187.38 ms │   3.6 µs │ cuLaunchKernel      │
│ 510187.38 ms │   3.9 µs │ cuLaunchKernel      │
│ 512187.39 ms │   3.7 µs │ cuLaunchKernel      │
│ 514187.4 ms │   2.7 µs │ cuLaunchKernel      │
│ 516187.4 ms │   3.6 µs │ cuLaunchKernel      │
│ 520187.42 ms │   3.9 µs │ cuLaunchKernel      │
│ 524187.43 ms │   3.0 µs │ cuLaunchKernel      │
│ 526187.43 ms │   4.6 µs │ cuLaunchKernel      │
│ 528187.45 ms │   9.7 µs │ cuLaunchKernel      │
│ 530187.46 ms │   5.0 µs │ cuLaunchKernel      │
│ 532187.47 ms │   3.6 µs │ cuLaunchKernel      │
│ 534187.48 ms │   6.6 µs │ cuLaunchKernel      │
│ 536187.49 ms │  11.3 µs │ cuLaunchKernel      │
│ 538187.51 ms │  11.8 µs │ cuLaunchKernel      │
│ 540187.53 ms │   4.3 µs │ cuLaunchKernel      │
│ 542187.54 ms │   3.4 µs │ cuLaunchKernel      │
│ 544187.54 ms │   4.1 µs │ cuLaunchKernel      │
│ 546187.55 ms │   9.1 µs │ cuLaunchKernel      │
│ 548187.59 ms │   4.5 µs │ cuLaunchKernel      │
│ 550187.6 ms │   5.1 µs │ cuLaunchKernel      │
│ 552187.61 ms │   8.3 µs │ cuLaunchKernel      │
│ 554187.62 ms │   6.4 µs │ cuLaunchKernel      │
│ 556187.63 ms │   6.3 µs │ cuLaunchKernel      │
│ 558187.65 ms │  12.2 µs │ cuLaunchKernel      │
│ 560187.67 ms │   7.3 µs │ cuLaunchKernel      │
│ 562187.68 ms │   8.9 µs │ cuLaunchKernel      │
│ 564187.7 ms │   5.5 µs │ cuLaunchKernel      │
│ 566187.71 ms │   7.0 µs │ cuLaunchKernel      │
│ 568187.72 ms │   7.0 µs │ cuLaunchKernel      │
│ 570187.73 ms │   9.0 µs │ cuMemsetD32Async    │
│ 572187.74 ms │   2.8 µs │ cuMemsetD32Async    │
│ 574187.75 ms │   6.4 µs │ cuLaunchKernel      │
│ 576187.76 ms │   9.9 µs │ cuLaunchKernel      │
│ 578187.78 ms │  17.3 µs │ cuLaunchKernel      │
│ 582187.81 ms │   4.6 µs │ cuLaunchKernel      │
│ 586187.82 ms │   9.9 µs │ cuLaunchKernel      │
│ 588187.83 ms │   5.4 µs │ cuLaunchKernel      │
│ 590187.84 ms │   7.3 µs │ cuLaunchKernel      │
│ 592187.86 ms │   4.5 µs │ cuLaunchKernel      │
│ 594187.87 ms │   7.1 µs │ cuLaunchKernel      │
│ 596187.88 ms │   4.7 µs │ cuLaunchKernel      │
│ 598187.89 ms │   6.5 µs │ cuLaunchKernel      │
│ 600187.9 ms │   5.5 µs │ cuLaunchKernel      │
│ 602187.91 ms │   6.2 µs │ cuLaunchKernel      │
│ 604187.92 ms │   5.2 µs │ cuLaunchKernel      │
│ 608187.93 ms │   4.4 µs │ cuLaunchKernel      │
│ 612187.94 ms │   3.2 µs │ cuLaunchKernel      │
│ 616187.96 ms │   4.1 µs │ cuLaunchKernel      │
│ 620187.97 ms │   8.7 µs │ cuLaunchKernel      │
│ 622187.98 ms │   4.2 µs │ cuLaunchKernel      │
│ 624187.99 ms │   6.7 µs │ cuLaunchKernel      │
│ 626188.0 ms │   5.0 µs │ cuLaunchKernel      │
│ 628188.01 ms │   3.0 µs │ cuLaunchKernel      │
│ 630188.01 ms │   6.1 µs │ cuLaunchKernel      │
│ 632188.02 ms │   4.3 µs │ cuLaunchKernel      │
│ 634188.03 ms │  10.8 µs │ cuLaunchKernel      │
│ 636188.05 ms │   3.9 µs │ cuLaunchKernel      │
│ 638188.06 ms │   3.2 µs │ cuLaunchKernel      │
│ 640188.06 ms │   4.1 µs │ cuLaunchKernel      │
│ 642188.07 ms │   8.8 µs │ cuLaunchKernel      │
│ 644188.08 ms │  12.0 µs │ cuMemsetD32Async    │
│ 646188.1 ms │   3.4 µs │ cuMemsetD32Async    │
│ 648188.1 ms │   2.8 µs │ cuMemsetD32Async    │
│ 650188.1 ms │   2.5 µs │ cuMemsetD32Async    │
│ 652188.11 ms │   5.9 µs │ cuLaunchKernel      │
│ 654188.12 ms │   8.9 µs │ cuLaunchKernel      │
│ 656188.14 ms │   6.9 µs │ cuLaunchKernel      │
│ 658188.15 ms │   6.5 µs │ cuLaunchKernel      │
│ 660188.16 ms │   5.4 µs │ cuLaunchKernel      │
│ 662188.17 ms │  8.35 ms │ cuStreamSynchronize │
│ 664196.53 ms │  13.4 µs │ cuLaunchKernel      │
│ 666196.55 ms │   4.5 µs │ cuLaunchKernel      │
│ 668196.56 ms │   4.8 µs │ cuLaunchKernel      │
│ 670196.57 ms │   3.7 µs │ cuLaunchKernel      │
│ 672196.57 ms │   2.9 µs │ cuLaunchKernel      │
│ 674196.58 ms │   3.6 µs │ cuLaunchKernel      │
│ 676196.61 ms │   7.1 µs │ cuLaunchKernel      │
│ 678196.62 ms │  19.7 µs │ cuLaunchKernel      │
│ 680196.66 ms │   7.4 µs │ cuLaunchKernel      │
│ 682196.67 ms │   6.0 µs │ cuLaunchKernel      │
│ 684196.68 ms │   5.4 µs │ cuLaunchKernel      │
│ 686196.69 ms │   5.7 µs │ cuLaunchKernel      │
│ 688196.7 ms │  16.8 µs │ cuLaunchKernel      │
│ 690196.73 ms │  10.6 µs │ cuLaunchKernel      │
│ 692196.75 ms │   7.0 µs │ cuLaunchKernel      │
│ 696196.76 ms │   5.4 µs │ cuLaunchKernel      │
│ 700196.78 ms │   4.1 µs │ cuLaunchKernel      │
│ 704196.79 ms │   3.6 µs │ cuLaunchKernel      │
│ 708196.8 ms │   3.1 µs │ cuLaunchKernel      │
│ 712196.81 ms │   4.1 µs │ cuLaunchKernel      │
│ 716196.82 ms │   4.0 µs │ cuLaunchKernel      │
│ 720196.83 ms │   4.0 µs │ cuLaunchKernel      │
│ 724196.84 ms │   3.9 µs │ cuLaunchKernel      │
│ 726196.85 ms │   3.8 ms │ cuStreamSynchronize │
│ 728200.66 ms │   5.9 µs │ cuLaunchKernel      │
│ 730200.66 ms │   3.4 µs │ cuLaunchKernel      │
│ 732200.67 ms │   7.5 µs │ cuLaunchKernel      │
│ 734200.68 ms │   3.8 µs │ cuLaunchKernel      │
│ 736200.69 ms │   2.8 µs │ cuLaunchKernel      │
│ 738200.69 ms │   3.5 µs │ cuLaunchKernel      │
│ 740200.7 ms │   4.1 µs │ cuLaunchKernel      │
│ 742200.71 ms │   6.5 µs │ cuLaunchKernel      │
│ 744200.72 ms │   4.3 µs │ cuLaunchKernel      │
│ 746200.73 ms │   3.3 µs │ cuLaunchKernel      │
│ 748200.74 ms │   6.3 µs │ cuLaunchKernel      │
│ 750200.75 ms │   4.6 µs │ cuLaunchKernel      │
│ 752200.76 ms │  17.4 µs │ cuLaunchKernel      │
│ 754200.78 ms │   5.0 µs │ cuLaunchKernel      │
│ 756200.79 ms │   3.6 µs │ cuLaunchKernel      │
│ 758200.8 ms │   4.1 µs │ cuLaunchKernel      │
│ 760200.81 ms │  10.6 µs │ cuLaunchKernel      │
│ 762201.17 ms │  11.2 µs │ cuLaunchKernel      │
│ 764201.2 ms │   5.4 µs │ cuLaunchKernel      │
│ 766201.22 ms │   5.4 µs │ cuLaunchKernel      │
│ 768201.25 ms │   8.3 µs │ cuLaunchKernel      │
│ 770201.26 ms │   5.1 µs │ cuLaunchKernel      │
│ 772201.28 ms │   3.3 µs │ cuLaunchKernel      │
│ 774201.29 ms │   6.0 µs │ cuLaunchKernel      │
│ 776201.31 ms │   3.4 µs │ cuLaunchKernel      │
│ 778201.32 ms │   4.8 µs │ cuLaunchKernel      │
│ 780201.33 ms │  13.1 µs │ cuLaunchKernel      │
│ 782201.36 ms │  16.9 µs │ cuLaunchKernel      │
│ 784201.38 ms │   3.9 µs │ cuLaunchKernel      │
│ 786201.4 ms │   6.1 µs │ cuLaunchKernel      │
│ 788201.41 ms │   3.4 µs │ cuLaunchKernel      │
│ 790201.42 ms │   5.6 µs │ cuLaunchKernel      │
│ 792201.44 ms │   3.3 µs │ cuLaunchKernel      │
│ 794201.47 ms │   6.2 µs │ cuLaunchKernel      │
│ 796201.48 ms │   3.3 µs │ cuLaunchKernel      │
│ 798201.5 ms │   6.1 µs │ cuLaunchKernel      │
│ 800201.51 ms │   3.4 µs │ cuLaunchKernel      │
│ 802201.52 ms │   3.7 µs │ cuLaunchKernel      │
│ 804201.54 ms │   4.8 µs │ cuLaunchKernel      │
│ 806201.55 ms │   3.6 µs │ cuLaunchKernel      │
│ 808201.56 ms │   4.4 µs │ cuLaunchKernel      │
└─────┴───────────┴──────────┴─────────────────────┘

Device-side activity: GPU was busy for 197.98 ms (97.56% of the trace)
┌─────┬───────────┬───────────┬─────────┬────────┬──────┬───────────────────┬─────────────┬──────────────┬─────────────────────────────────────────────────────────────────────────────────────────────
│  ID │     Start │      Time │ Threads │ Blocks │ Regs │        Shared Mem │        Size │   Throughput │ Name                                                                                       
├─────┼───────────┼───────────┼─────────┼────────┼──────┼───────────────────┼─────────────┼──────────────┼─────────────────────────────────────────────────────────────────────────────────────────────
│   41.13 ms │   1.51 ms │    1024564--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 82.64 ms │ 426.59 µs │    1024564--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 103.07 ms │ 143.68 µs │       ----1.055 MiB │  7.168 GiB/s │ [set device memory]                                                                        123.22 ms │  44.48 µs │       ----283.500 KiB │  6.079 GiB/s │ [set device memory]                                                                        143.26 ms │ 145.82 µs │       ----1.055 MiB │  7.063 GiB/s │ [set device memory]                                                                        163.41 ms │  42.78 µs │       ----283.500 KiB │  6.319 GiB/s │ [set device memory]                                                                        183.46 ms │   1.04 ms │  4×4×16216×2402.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_L 204.5 ms │   1.11 ms │  4×4×16216×2402.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_L 225.61 ms │ 970.46 µs │  4×4×16216×2392.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6 246.59 ms │   3.15 ms │  4×4×16216×2629.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9 269.73 ms │ 893.63 µs │  4×4×16216×2403.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16Placeh 3010.96 ms │   1.93 ms │     256203125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5E 3212.9 ms │ 447.61 µs │     25615353--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_L 3413.35 ms │   1.92 ms │     256203117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li 3615.27 ms │ 520.19 µs │     256203125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5E 3815.8 ms │ 161.15 µs │     2565153--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_L 4015.96 ms │ 502.52 µs │     256203117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li 4216.46 ms │   3.17 ms │  4×4×16216×2629.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9 4419.63 ms │ 548.96 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS2_ 4620.19 ms │   1.47 ms │     25628433--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 4821.66 ms │   1.23 ms │  4×4×16216×2623.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_L 5022.9 ms │ 424.28 µs │  4×4×16216×2312.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAx 5223.32 ms │   1.12 ms │     25628433--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxi 5424.44 ms │  687.0 µs │  4×4×16216×2321024 bytes static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAx 5625.15 ms │   7.15 ms │     25627078--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_ 5832.29 ms │   2.97 ms │     25628458--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 6235.27 ms │   1.89 ms │     768577--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 6637.16 ms │ 521.79 µs │     768577--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 7037.68 ms │   1.87 ms │     768577--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 7439.56 ms │  520.7 µs │     768577--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 7840.69 ms │   2.06 ms │     256203125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5E 8042.76 ms │ 589.76 µs │     25620453--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_L 8243.35 ms │   2.04 ms │     256203117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li 8445.39 ms │ 547.97 µs │     256213125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5E 8645.94 ms │ 170.05 µs │     2565453--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_L 8846.12 ms │ 529.76 µs │     256213117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li 9246.65 ms │   1.51 ms │    1024564--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 9648.16 ms │ 429.02 µs │    1024564--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 9848.59 ms │  314.3 µs │  4×4×16216×229--- │ _Z11knl_copyto_5VIJFHI10NamedTupleI9__e_tot__5TupleI7Float32EELi4E13CuDeviceArrayIS2_Li5EL 10048.91 ms │   1.77 ms │     25628448--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxi 10250.67 ms │  81.57 µs │     4×421630--- │ _Z11knl_copyto_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int6 10450.76 ms │   33.5 µs │     4×421623--- │ _Z9knl_fill_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int645S 10650.79 ms │ 857.92 µs │     25627033--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9_ 10851.65 ms │ 536.45 µs │  4×4×16216×233--- │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxisI4_3__EE6SArrayIS 11052.22 ms │   2.18 ms │     25627072--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16Placeho 11254.4 ms │ 557.56 µs │  4×4×16216×231--- │ _Z11knl_copyto_5VIJFHI8PhaseDryI7Float32ELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10 11454.96 ms │ 374.01 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 11655.34 ms │  378.3 µs │  4×4×16216×231--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 11855.72 ms │ 971.74 µs │     4×421658--- │ _Z11knl_copyto_4IJFHI10NamedTupleI73__ts___ustar___obukhov_length___buoyancy_flux_____flux 12056.7 ms │ 232.16 µs │  4×4×16216×221--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 12256.93 ms │ 657.21 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI13BandMatrixRowILin0ELi1E7AdjointI7Float3210AxisTensorIS2_Li1E5Tuple 12457.59 ms │   1.13 ms │     25627048--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 12658.73 ms │   1.11 ms │     25628437--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 12859.84 ms │   2.11 ms │     25627048--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 13061.95 ms │ 849.76 µs │     25627032--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 13262.84 ms │   1.85 ms │     25627064--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 13464.69 ms │   3.33 ms │     25628450--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 13668.02 ms │   1.66 ms │     25628433--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 13869.68 ms │   2.86 ms │     25628448--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 14072.55 ms │   2.95 ms │     25628448--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowILin1ELi3E10AxisTensorI7Float32Li2E 14275.51 ms │ 135.97 µs │       ----1.055 MiB │  7.575 GiB/s │ [set device memory]                                                                        14475.65 ms │  40.96 µs │       ----283.500 KiB │  6.601 GiB/s │ [set device memory]                                                                        14675.69 ms │   2.65 ms │     25627049--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li 14878.34 ms │   2.65 ms │     25627056--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li 15080.99 ms │   2.55 ms │     25628438--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 15483.54 ms │   2.36 ms │     640594--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 15885.91 ms │  652.6 µs │     640594--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 16086.61 ms │   1.04 ms │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int64___1_Li2E10AxisT 16287.65 ms │   4.99 ms │     25628456--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowILin1ELi3E10AxisTensorI7Float32Li2E 16492.65 ms │  482.4 µs │  4×4×16216×227--- │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS2_ 16693.13 ms │   3.01 ms │     25628440--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 16896.14 ms │   1.65 ms │     2561447--- │ _Z28multiple_field_solve_kernel_10CUDADevice5TupleIS0_I5FieldI5VIJFHIS0_Li4E13CuDeviceArra 17097.79 ms │   2.38 ms │     25628440--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 172100.18 ms │   3.47 ms │     2561443--- │ _Z28multiple_field_solve_kernel_10CUDADevice5TupleIS0_I5FieldI5VIJFHIS0_I10AxisTensorI7Flo 174103.65 ms │ 847.55 µs │     25627040--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li 176104.5 ms │  845.4 µs │     25627040--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li 180105.35 ms │   1.85 ms │     768577--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 184107.2 ms │ 502.81 µs │     768577--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 188107.71 ms │   2.45 ms │     768579--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 192110.16 ms │ 766.97 µs │     768579--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 194110.96 ms │ 311.78 µs │  4×4×16216×229--- │ _Z11knl_copyto_5VIJFHI10NamedTupleI9__e_tot__5TupleI7Float32EELi4E13CuDeviceArrayIS2_Li5EL 196111.28 ms │   1.77 ms │     25628448--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxi 198113.05 ms │  81.82 µs │     4×421630--- │ _Z11knl_copyto_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int6 200113.13 ms │  33.38 µs │     4×421623--- │ _Z9knl_fill_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int645S 202113.17 ms │ 847.99 µs │     25627033--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9_ 204114.02 ms │ 536.83 µs │  4×4×16216×233--- │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxisI4_3__EE6SArrayIS 206114.56 ms │   2.19 ms │     25627072--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16Placeho 208116.74 ms │ 558.04 µs │  4×4×16216×231--- │ _Z11knl_copyto_5VIJFHI8PhaseDryI7Float32ELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10 210117.31 ms │ 372.57 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 212117.68 ms │ 378.53 µs │  4×4×16216×231--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 214118.06 ms │ 970.36 µs │     4×421658--- │ _Z11knl_copyto_4IJFHI10NamedTupleI73__ts___ustar___obukhov_length___buoyancy_flux_____flux 216119.04 ms │  138.5 µs │       ----1.055 MiB │  7.437 GiB/s │ [set device memory]                                                                        218119.19 ms │  39.58 µs │       ----283.500 KiB │  6.830 GiB/s │ [set device memory]                                                                        220119.23 ms │ 146.43 µs │       ----1.055 MiB │  7.034 GiB/s │ [set device memory]                                                                        222119.38 ms │  47.04 µs │       ----283.500 KiB │  5.748 GiB/s │ [set device memory]                                                                        224119.43 ms │   1.03 ms │  4×4×16216×2402.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_L 226120.47 ms │   1.12 ms │  4×4×16216×2402.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_L 228121.59 ms │ 971.93 µs │  4×4×16216×2392.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6 230122.56 ms │   3.12 ms │  4×4×16216×2629.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9 232125.69 ms │  902.2 µs │  4×4×16216×2403.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16Placeh 236126.98 ms │   1.95 ms │     256203125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5E 238128.93 ms │ 447.23 µs │     25615353--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_L 240129.38 ms │   1.92 ms │     256203117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li 242131.31 ms │ 519.48 µs │     256203125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5E 244131.83 ms │ 162.66 µs │     2565153--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_L 246131.99 ms │ 502.62 µs │     256203117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li 248132.5 ms │   3.16 ms │  4×4×16216×2629.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9 250135.66 ms │ 549.69 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS2_ 252136.22 ms │   1.45 ms │     25628433--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 254137.67 ms │   1.23 ms │  4×4×16216×2623.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_L 256138.9 ms │ 427.36 µs │  4×4×16216×2312.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAx 258139.33 ms │   1.12 ms │     25628433--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxi 260140.46 ms │ 684.51 µs │  4×4×16216×2321024 bytes static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAx 262141.17 ms │   7.13 ms │     25627078--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_ 264148.3 ms │   2.96 ms │     25628458--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 268151.26 ms │   2.34 ms │     640592--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 272153.61 ms │ 656.54 µs │     640592--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 276154.27 ms │   2.31 ms │     640592--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 280156.58 ms │ 640.96 µs │     640592--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 284157.22 ms │    1.9 ms │     768577--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 288159.13 ms │ 521.73 µs │     768577--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 292160.34 ms │   2.07 ms │     256203125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5E 294162.41 ms │ 586.27 µs │     25620453--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_L 296163.0 ms │   2.04 ms │     256203117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li 298165.04 ms │ 548.77 µs │     256213125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5E 300165.6 ms │ 172.19 µs │     2565453--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_L 302165.77 ms │ 529.53 µs │     256213117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li 306166.3 ms │   1.51 ms │    1024564--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 310167.81 ms │ 429.05 µs │    1024564--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 312168.24 ms │ 318.43 µs │  4×4×16216×229--- │ _Z11knl_copyto_5VIJFHI10NamedTupleI9__e_tot__5TupleI7Float32EELi4E13CuDeviceArrayIS2_Li5EL 314168.56 ms │   1.78 ms │     25628448--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxi 316170.34 ms │  82.21 µs │     4×421630--- │ _Z11knl_copyto_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int6 318170.43 ms │  33.28 µs │     4×421623--- │ _Z9knl_fill_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int645S 320170.46 ms │ 847.83 µs │     25627033--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9_ 322171.31 ms │ 539.04 µs │  4×4×16216×233--- │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxisI4_3__EE6SArrayIS 324171.89 ms │   1.23 ms │     25627072--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16Placeho 326173.12 ms │ 275.04 µs │  4×4×16216×231--- │ _Z11knl_copyto_5VIJFHI8PhaseDryI7Float32ELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10 328173.4 ms │ 136.83 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 330173.54 ms │ 143.68 µs │  4×4×16216×231--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 332173.68 ms │ 798.97 µs │     4×421658--- │ _Z11knl_copyto_4IJFHI10NamedTupleI73__ts___ustar___obukhov_length___buoyancy_flux_____flux 334174.48 ms │  58.37 µs │  4×4×16216×221--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 336174.54 ms │ 443.39 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI13BandMatrixRowILin0ELi1E7AdjointI7Float3210AxisTensorIS2_Li1E5Tuple 338174.99 ms │ 366.53 µs │     25627048--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 340175.36 ms │ 199.36 µs │     25628437--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 342175.56 ms │ 295.55 µs │     25627048--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 344175.85 ms │ 155.68 µs │     25627032--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 346176.04 ms │ 408.89 µs │     25627064--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 348176.45 ms │ 860.19 µs │     25628450--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 350177.31 ms │ 152.42 µs │     25628433--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 352177.46 ms │ 324.35 µs │     25628448--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 354177.79 ms │ 384.45 µs │     25628448--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowILin1ELi3E10AxisTensorI7Float32Li2E 356178.17 ms │  11.71 µs │       ----1.055 MiB │ 87.942 GiB/s │ [set device memory]                                                                        358178.18 ms │    4.1 µs │       ----283.500 KiB │ 66.008 GiB/s │ [set device memory]                                                                        360178.19 ms │ 245.92 µs │     25627049--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li 362178.44 ms │ 254.37 µs │     25627056--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li 364178.69 ms │ 230.37 µs │     25628438--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 368178.92 ms │ 209.92 µs │     640594--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 372179.13 ms │  57.73 µs │     640594--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 374179.2 ms │ 124.25 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int64___1_Li2E10AxisT 376179.33 ms │ 755.26 µs │     25628456--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowILin1ELi3E10AxisTensorI7Float32Li2E 378180.08 ms │  56.38 µs │  4×4×16216×227--- │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS2_ 380180.14 ms │ 274.69 µs │     25628440--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 382180.42 ms │ 163.26 µs │     2561447--- │ _Z28multiple_field_solve_kernel_10CUDADevice5TupleIS0_I5FieldI5VIJFHIS0_Li4E13CuDeviceArra 384180.58 ms │ 217.63 µs │     25628440--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 386180.8 ms │ 337.18 µs │     2561443--- │ _Z28multiple_field_solve_kernel_10CUDADevice5TupleIS0_I5FieldI5VIJFHIS0_I10AxisTensorI7Flo 388181.14 ms │  79.55 µs │     25627040--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li 390181.22 ms │  75.68 µs │     25627040--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li 394181.29 ms │ 166.94 µs │     768577--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 398181.46 ms │  45.02 µs │     768577--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 402181.51 ms │  238.4 µs │     768579--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 406181.74 ms │  71.68 µs │     768579--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 408181.82 ms │  35.14 µs │  4×4×16216×229--- │ _Z11knl_copyto_5VIJFHI10NamedTupleI9__e_tot__5TupleI7Float32EELi4E13CuDeviceArrayIS2_Li5EL 410181.86 ms │ 165.37 µs │     25628448--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxi 412182.02 ms │   9.25 µs │     4×421630--- │ _Z11knl_copyto_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int6 414182.03 ms │   3.74 µs │     4×421623--- │ _Z9knl_fill_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int645S 416182.04 ms │  78.72 µs │     25627033--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9_ 418182.12 ms │  55.39 µs │  4×4×16216×233--- │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxisI4_3__EE6SArrayIS 420182.17 ms │ 205.06 µs │     25627072--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16Placeho 422182.38 ms │  65.66 µs │  4×4×16216×231--- │ _Z11knl_copyto_5VIJFHI8PhaseDryI7Float32ELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10 424182.45 ms │   36.7 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 426182.49 ms │  36.22 µs │  4×4×16216×231--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 428182.53 ms │ 120.22 µs │     4×421658--- │ _Z11knl_copyto_4IJFHI10NamedTupleI73__ts___ustar___obukhov_length___buoyancy_flux_____flux 430182.65 ms │  11.52 µs │       ----1.055 MiB │ 89.407 GiB/s │ [set device memory]                                                                        432182.68 ms │   4.29 µs │       ----283.500 KiB │ 63.051 GiB/s │ [set device memory]                                                                        434182.68 ms │  11.81 µs │       ----1.055 MiB │ 87.226 GiB/s │ [set device memory]                                                                        436182.69 ms │   4.06 µs │       ----283.500 KiB │ 66.526 GiB/s │ [set device memory]                                                                        438182.7 ms │ 103.29 µs │  4×4×16216×2402.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_L 440182.8 ms │ 108.38 µs │  4×4×16216×2402.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_L 442182.91 ms │  97.95 µs │  4×4×16216×2392.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6 444183.01 ms │ 319.49 µs │  4×4×16216×2629.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9 446183.33 ms │  89.41 µs │  4×4×16216×2403.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16Placeh 450183.68 ms │  192.8 µs │     256203125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5E 452183.88 ms │  51.46 µs │     25615353--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_L 454183.93 ms │ 188.48 µs │     256203117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li 456184.12 ms │  53.98 µs │     256203125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5E 458184.17 ms │  17.54 µs │     2565153--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_L 460184.19 ms │  49.98 µs │     256203117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li 462184.24 ms │ 320.32 µs │  4×4×16216×2629.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9 464184.56 ms │  65.92 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS2_ 466184.63 ms │ 133.02 µs │     25628433--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 468184.76 ms │ 120.93 µs │  4×4×16216×2623.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_L 470184.88 ms │  43.39 µs │  4×4×16216×2312.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAx 472184.93 ms │ 103.36 µs │     25628433--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxi 474185.03 ms │  77.57 µs │  4×4×16216×2321024 bytes static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAx 476185.12 ms │   1.05 ms │     25627078--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_ 478186.18 ms │ 281.02 µs │     25628458--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 482186.46 ms │ 247.36 µs │     5125115--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 486186.71 ms │  70.02 µs │     5125115--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 490186.78 ms │ 240.89 µs │     5125115--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 494187.02 ms │  69.02 µs │     5125115--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 498187.09 ms │ 207.36 µs │     640592--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 502187.3 ms │  58.14 µs │     640592--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 506187.55 ms │ 207.23 µs │     256203125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5E 508187.76 ms │  66.34 µs │     25620453--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_L 510187.82 ms │ 201.53 µs │     256203117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li 512188.03 ms │  56.67 µs │     256213125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5E 514188.08 ms │  18.05 µs │     2565453--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_L 516188.1 ms │  52.35 µs │     256213117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li 520188.16 ms │ 136.42 µs │    1024564--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 524188.29 ms │  38.14 µs │    1024564--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 526188.33 ms │  34.91 µs │  4×4×16216×229--- │ _Z11knl_copyto_5VIJFHI10NamedTupleI9__e_tot__5TupleI7Float32EELi4E13CuDeviceArrayIS2_Li5EL 528188.37 ms │ 166.53 µs │     25628448--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxi 530188.53 ms │   9.09 µs │     4×421630--- │ _Z11knl_copyto_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int6 532188.54 ms │   3.71 µs │     4×421623--- │ _Z9knl_fill_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int645S 534188.55 ms │  78.98 µs │     25627033--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9_ 536188.63 ms │  54.85 µs │  4×4×16216×233--- │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxisI4_3__EE6SArrayIS 538188.69 ms │ 205.82 µs │     25627072--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16Placeho 540188.9 ms │  65.38 µs │  4×4×16216×231--- │ _Z11knl_copyto_5VIJFHI8PhaseDryI7Float32ELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10 542188.96 ms │  36.93 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 544189.0 ms │  36.26 µs │  4×4×16216×231--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 546189.04 ms │ 120.96 µs │     4×421658--- │ _Z11knl_copyto_4IJFHI10NamedTupleI73__ts___ustar___obukhov_length___buoyancy_flux_____flux 548189.16 ms │  26.02 µs │  4×4×16216×221--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 550189.19 ms │   79.1 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI13BandMatrixRowILin0ELi1E7AdjointI7Float3210AxisTensorIS2_Li1E5Tuple 552189.26 ms │ 105.06 µs │     25627048--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 554189.37 ms │ 102.17 µs │     25628437--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 556189.47 ms │ 197.09 µs │     25627048--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 558189.67 ms │  76.22 µs │     25627032--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 560189.75 ms │ 174.69 µs │     25627064--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 562189.93 ms │  501.5 µs │     25628450--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 564190.43 ms │ 153.44 µs │     25628433--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 566190.59 ms │ 306.34 µs │     25628448--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int 568190.89 ms │ 383.81 µs │     25628448--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowILin1ELi3E10AxisTensorI7Float32Li2E 570191.28 ms │  11.62 µs │       ----1.055 MiB │ 88.669 GiB/s │ [set device memory]                                                                        572191.29 ms │   4.03 µs │       ----283.500 KiB │ 67.053 GiB/s │ [set device memory]                                                                        574191.29 ms │ 246.11 µs │     25627049--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li 576191.54 ms │ 253.09 µs │     25627056--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li 578191.8 ms │ 230.02 µs │     25628438--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 582192.03 ms │ 209.38 µs │     640594--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 586192.24 ms │  58.05 µs │     640594--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 588192.3 ms │ 124.13 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI13BandMatrixRowI39ClimaCore_Utilities_PlusHalf_Int64___1_Li2E10AxisT 590192.43 ms │ 755.32 µs │     25628456--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI13BandMatrixRowILin1ELi3E10AxisTensorI7Float32Li2E 592193.19 ms │  56.38 µs │  4×4×16216×227--- │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS2_ 594193.24 ms │ 271.45 µs │     25628440--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 596193.51 ms │ 161.37 µs │     2561447--- │ _Z28multiple_field_solve_kernel_10CUDADevice5TupleIS0_I5FieldI5VIJFHIS0_Li4E13CuDeviceArra 598193.68 ms │ 219.65 µs │     25628440--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 600193.9 ms │ 337.25 µs │     2561443--- │ _Z28multiple_field_solve_kernel_10CUDADevice5TupleIS0_I5FieldI5VIJFHIS0_I10AxisTensorI7Flo 602194.24 ms │  79.04 µs │     25627040--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li 604194.32 ms │  75.58 µs │     25627040--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_Li 608194.39 ms │ 167.13 µs │     768577--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 612194.56 ms │  44.96 µs │     768577--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 616194.6 ms │ 233.44 µs │     768579--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 620194.84 ms │  71.74 µs │     768579--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 622194.92 ms │  34.69 µs │  4×4×16216×229--- │ _Z11knl_copyto_5VIJFHI10NamedTupleI9__e_tot__5TupleI7Float32EELi4E13CuDeviceArrayIS2_Li5EL 624194.95 ms │  165.6 µs │     25628448--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxi 626195.12 ms │   8.96 µs │     4×421630--- │ _Z11knl_copyto_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int6 628195.13 ms │   3.78 µs │     4×421623--- │ _Z9knl_fill_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int645S 630195.13 ms │  78.56 µs │     25627033--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9_ 632195.21 ms │  55.46 µs │  4×4×16216×233--- │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxisI4_3__EE6SArrayIS 634195.27 ms │  205.6 µs │     25627072--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16Placeho 636195.48 ms │  65.02 µs │  4×4×16216×231--- │ _Z11knl_copyto_5VIJFHI8PhaseDryI7Float32ELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10 638195.54 ms │  37.12 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 640195.58 ms │  36.16 µs │  4×4×16216×231--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 642195.62 ms │ 120.25 µs │     4×421658--- │ _Z11knl_copyto_4IJFHI10NamedTupleI73__ts___ustar___obukhov_length___buoyancy_flux_____flux 644195.74 ms │  11.39 µs │       ----1.055 MiB │ 90.412 GiB/s │ [set device memory]                                                                        646195.77 ms │   4.13 µs │       ----283.500 KiB │ 65.494 GiB/s │ [set device memory]                                                                        648195.77 ms │  11.94 µs │       ----1.055 MiB │ 86.290 GiB/s │ [set device memory]                                                                        650195.79 ms │   4.16 µs │       ----283.500 KiB │ 64.992 GiB/s │ [set device memory]                                                                        652195.79 ms │ 102.97 µs │  4×4×16216×2402.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_L 654195.89 ms │ 108.61 µs │  4×4×16216×2402.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_L 656196.0 ms │  98.75 µs │  4×4×16216×2392.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6 658196.1 ms │ 316.86 µs │  4×4×16216×2629.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9 660196.42 ms │  89.15 µs │  4×4×16216×2403.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16Placeh 664196.77 ms │ 192.06 µs │     256203125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5E 666196.96 ms │  50.56 µs │     25615353--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_L 668197.01 ms │ 189.02 µs │     256203117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li 670197.2 ms │  54.11 µs │     256203125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5E 672197.26 ms │  17.79 µs │     2565153--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_L 674197.27 ms │  49.63 µs │     256203117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li 676197.32 ms │ 319.23 µs │  4×4×16216×2629.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9 678197.64 ms │  65.47 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_1__2_EE6SArrayIS2_ 680197.71 ms │ 131.84 µs │     25628433--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 682197.84 ms │ 120.54 µs │  4×4×16216×2623.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI7Float32Li4E8SubArrayIS1_Li5E13CuDeviceArrayIS1_L 684197.96 ms │  43.07 µs │  4×4×16216×2312.000 KiB static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAx 686198.01 ms │ 103.01 µs │     25628433--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxi 688198.11 ms │  76.48 µs │  4×4×16216×2321024 bytes static │           -- │ _Z23copyto_spectral_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAx 690198.2 ms │   1.05 ms │     25627078--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI6_ 692199.25 ms │  279.9 µs │     25628458--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI4_ 696199.53 ms │ 247.68 µs │     5125115--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 700199.78 ms │   70.5 µs │     5125115--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 704199.85 ms │ 136.38 µs │    1024564--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 708199.99 ms │  38.27 µs │    1024564--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 712200.03 ms │ 240.25 µs │     5125115--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 716200.27 ms │   68.0 µs │     5125115--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 720200.34 ms │ 240.42 µs │     5125115--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 724200.58 ms │  68.35 µs │     5125115--- │ _Z3_3515CuKernelContext13CuDeviceArrayI7Float32Li5ELi1EE11BroadcastedI12CuArrayStyleILi5E1 728200.82 ms │ 206.46 µs │     256203125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5E 730201.03 ms │  66.46 µs │     25620453--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_L 732201.1 ms │ 202.65 µs │     256203117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li 734201.3 ms │  56.19 µs │     256213125--- │ _Z21dss_transform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li5E 736201.36 ms │  18.18 µs │     2565453--- │ _Z17dss_local_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_I5TupleI5Int64S2_ELi1ELi1EES_IS2_L 738201.37 ms │  52.29 µs │     256213117--- │ _Z23dss_untransform_kernel_13CuDeviceArrayI7Float32Li4ELi1EES_IS0_Li5ELi1EE8SubArrayIS0_Li 740201.43 ms │  34.72 µs │  4×4×16216×229--- │ _Z11knl_copyto_5VIJFHI10NamedTupleI9__e_tot__5TupleI7Float32EELi4E13CuDeviceArrayIS2_Li5EL 742201.46 ms │ 164.13 µs │     25628448--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxi 744201.63 ms │   9.19 µs │     4×421630--- │ _Z11knl_copyto_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int6 746201.64 ms │   3.84 µs │     4×421623--- │ _Z9knl_fill_4IJFHI7Float32Li4E8SubArrayIS0_Li4E13CuDeviceArrayIS0_Li5ELi1EE5TupleI5Int645S 748201.64 ms │  78.88 µs │     25627033--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI10AxisTensorI7Float32Li1E5TupleI13CovariantAxisI9_ 750201.72 ms │  55.14 µs │  4×4×16216×233--- │ _Z11knl_copyto_5VIJFHI10AxisTensorI7Float32Li1E5TupleI17ContravariantAxisI4_3__EE6SArrayIS 752201.78 ms │ 205.53 µs │     25627072--- │ _Z22copyto_stencil_kernel_5FieldI5VIJFHI7Float32Li4E13CuDeviceArrayIS1_Li5ELi1EEE16Placeho 754201.99 ms │  65.82 µs │  4×4×16216×231--- │ _Z11knl_copyto_5VIJFHI8PhaseDryI7Float32ELi4E13CuDeviceArrayIS1_Li5ELi1EEE11BroadcastedI10 756202.06 ms │   36.9 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 758202.1 ms │  36.42 µs │  4×4×16216×231--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 760202.14 ms │ 120.32 µs │     4×421658--- │ _Z11knl_copyto_4IJFHI10NamedTupleI73__ts___ustar___obukhov_length___buoyancy_flux_____flux 762202.26 ms │   5.34 µs │     4×421619--- │ _Z11knl_copyto_4IJFHI7Float32Li4E13CuDeviceArrayIS0_Li4ELi1EEE11BroadcastedI9IJFHStyleILi4 764202.26 ms │   4.86 µs │     4×421621--- │ _Z11knl_copyto_4IJFHI7Float32Li4E13CuDeviceArrayIS0_Li4ELi1EEE11BroadcastedI9IJFHStyleILi4 766202.27 ms │  31.74 µs │  4×4×16216×222--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 768202.3 ms │  33.63 µs │  4×4×16216×227--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 770202.34 ms │  41.18 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 772202.38 ms │  33.06 µs │  4×4×16216×227--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 774202.41 ms │  31.23 µs │  4×4×16216×222--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 776202.44 ms │   32.7 µs │  4×4×16216×227--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 778202.48 ms │   29.6 µs │  4×4×16216×222--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 780202.51 ms │  32.67 µs │  4×4×16216×227--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 782202.55 ms │  30.18 µs │  4×4×16216×228--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 784202.58 ms │  32.96 µs │  4×4×16216×227--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 786202.61 ms │  49.92 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 788202.66 ms │  32.67 µs │  4×4×16216×227--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 790202.7 ms │  50.69 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 792202.75 ms │  32.96 µs │  4×4×16216×227--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 794202.78 ms │   49.6 µs │  4×4×16216×232--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 796202.83 ms │  33.06 µs │  4×4×16216×227--- │ _Z11knl_copyto_5VIJFHI7Float32Li4E13CuDeviceArrayIS0_Li5ELi1EEE11BroadcastedI10VIJFHStyleI 798202.88 ms │   5.47 µs │     4×421622--- │ _Z11knl_copyto_4IJFHI7Float32Li4E13CuDeviceArrayIS0_Li4ELi1EEE11BroadcastedI9IJFHStyleILi4 800202.88 ms │   4.51 µs │     4×421621--- │ _Z11knl_copyto_4IJFHI7Float32Li4E13CuDeviceArrayIS0_Li4ELi1EEE11BroadcastedI9IJFHStyleILi4 802202.89 ms │   10.3 µs │     4×421619--- │ _Z11knl_copyto_4IJFHI7Float32Li4E13CuDeviceArrayIS0_Li4ELi1EEE11BroadcastedI9IJFHStyleILi4 804202.9 ms │   5.44 µs │     4×421621--- │ _Z11knl_copyto_4IJFHI7Float32Li4E13CuDeviceArrayIS0_Li4ELi1EEE11BroadcastedI9IJFHStyleILi4 806202.9 ms │   3.65 µs │     4×421619--- │ _Z11knl_copyto_4IJFHI7Float32Li4E13CuDeviceArrayIS0_Li4ELi1EEE11BroadcastedI9IJFHStyleILi4 808202.91 ms │   5.47 µs │     4×421621--- │ _Z11knl_copyto_4IJFHI7Float32Li4E13CuDeviceArrayIS0_Li4ELi1EEE11BroadcastedI9IJFHStyleILi4 
└─────┴───────────┴───────────┴─────────┴────────┴──────┴───────────────────┴─────────────┴──────────────┴─────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                       1 column omitted


[ Info: (#)x entries have been multiplied by corresponding factors in order to compute percentages
┌─────────────────────┬────────────┬──────────┬───────────┬────────────┬───────────┬───────────┬───────────┬──────────────────┐
│ Function            │     Memory │   allocs │      Time │       Time │      Time │      Time │ N-samples │ step! percentage │
│                     │   estimate │ estimate │       min │        max │      mean │    median │           │                  │
├─────────────────────┼────────────┼──────────┼───────────┼────────────┼───────────┼───────────┼───────────┼──────────────────┤
│ Wfact (3x)          │  35.58 KiB │     11046.830 ms │  57.416 ms │ 38.465 ms │ 56.796 ms │        1017.5461 │
│ ldiv! (3x)          │  23.81 KiB │     102014.086 ms │  57.116 ms │ 52.566 ms │ 56.787 ms │        1036.1898 │
│ T_imp! (3x)         │  16.31 KiB │      86414.184 ms │  24.836 ms │ 23.616 ms │ 24.653 ms │        1036.4403 │
│ T_exp_T_lim! (4x)   │  51.44 KiB │     177615.526 ms │ 131.409 ms │ 35.957 ms │ 15.799 ms │        1039.8879 │
│ lim! (4x)           │  128 bytes │        834.800 μs │  38.000 μs │ 35.640 μs │ 35.000 μs │        100.0894071 │
│ dss! (4x)           │  27.12 KiB │     125625.866 ms │  26.830 ms │ 26.155 ms │ 25.979 ms │        1066.4551 │
│ post_explicit! (3x) │  26.09 KiB │     108325.018 ms │  25.415 ms │ 25.145 ms │ 25.100 ms │        1064.2752 │
│ post_implicit! (4x) │  34.78 KiB │     144432.992 ms │  33.870 ms │ 33.236 ms │ 33.179 ms │        1084.762 │
│ step! (1x)          │ 278.23 KiB │     396238.923 ms │ 352.252 ms │ 89.016 ms │ 45.617 ms │        10100.0 │
└─────────────────────┴────────────┴──────────┴───────────┴────────────┴───────────┴───────────┴───────────┴──────────────────┘
Test Summary:              |Time
Benchmark allocation tests | None  0.0s

@charleskawczynski charleskawczynski merged commit 83cf6b6 into main May 3, 2024
5 of 6 checks passed
@charleskawczynski charleskawczynski deleted the ck/benchmark branch May 3, 2024 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant